63 research outputs found
On Weighted Multivariate Sign Functions
Multivariate sign functions are often used for robust estimation and
inference. We propose using data dependent weights in association with such
functions. The proposed weighted sign functions retain desirable robustness
properties, while significantly improving efficiency in estimation and
inference compared to unweighted multivariate sign-based methods. Using
weighted signs, we demonstrate methods of robust location estimation and robust
principal component analysis. We extend the scope of using robust multivariate
methods to include robust sufficient dimension reduction and functional outlier
detection. Several numerical studies and real data applications demonstrate the
efficacy of the proposed methodology.Comment: Keywords: Multivariate sign, Principal component analysis, Data
depth, Sufficient dimension reductio
Generalized bootstrap for estimating equations
We introduce a generalized bootstrap technique for estimators obtained by
solving estimating equations. Some special cases of this generalized bootstrap
are the classical bootstrap of Efron, the delete-d jackknife and variations of
the Bayesian bootstrap. The use of the proposed technique is discussed in some
examples. Distributional consistency of the method is established and an
asymptotic representation of the resampling variance estimator is obtained.Comment: Published at http://dx.doi.org/10.1214/009053604000000904 in the
Annals of Statistics (http://www.imstat.org/aos/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Parametric bootstrap approximation to the distribution of EBLUP and related prediction intervals in linear mixed models
Empirical best linear unbiased prediction (EBLUP) method uses a linear mixed
model in combining information from different sources of information. This
method is particularly useful in small area problems. The variability of an
EBLUP is traditionally measured by the mean squared prediction error (MSPE),
and interval estimates are generally constructed using estimates of the MSPE.
Such methods have shortcomings like under-coverage or over-coverage, excessive
length and lack of interpretability. We propose a parametric bootstrap approach
to estimate the entire distribution of a suitably centered and scaled EBLUP.
The bootstrap histogram is highly accurate, and differs from the true EBLUP
distribution by only , where is the number of parameters
and the number of observations. This result is used to obtain highly
accurate prediction intervals. Simulation results demonstrate the superiority
of this method over existing techniques of constructing prediction intervals in
linear mixed models.Comment: Published in at http://dx.doi.org/10.1214/07-AOS512 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Feature Selection using e-values
In the context of supervised parametric models, we introduce the concept of
e-values. An e-value is a scalar quantity that represents the proximity of the
sampling distribution of parameter estimates in a model trained on a subset of
features to that of the model trained on all features (i.e. the full model).
Under general conditions, a rank ordering of e-values separates models that
contain all essential features from those that do not.
The e-values are applicable to a wide range of parametric models. We use data
depths and a fast resampling-based algorithm to implement a feature selection
procedure using e-values, providing consistency results. For a -dimensional
feature space, this procedure requires fitting only the full model and
evaluating models, as opposed to the traditional requirement of fitting
and evaluating models. Through experiments across several model settings
and synthetic and real datasets, we establish that the e-values method as a
promising general alternative to existing model-specific methods of feature
selection.Comment: accepted in ICML-202
Simultaneous Selection of Multiple Important Single Nucleotide Polymorphisms in Familial Genome Wide Association Studies Data
We propose a resampling-based fast variable selection technique for selecting
important Single Nucleotide Polymorphisms (SNP) in multi-marker mixed effect
models used in twin studies. Due to computational complexity, current practice
includes testing the effect of one SNP at a time, commonly termed as `single
SNP association analysis'. Joint modeling of genetic variants within a gene or
pathway may have better power to detect the relevant genetic variants, hence we
adapt our recently proposed framework of -values to address this. In this
paper, we propose a computationally efficient approach for single SNP detection
in families while utilizing information on multiple SNPs simultaneously. We
achieve this through improvements in two aspects. First, unlike other model
selection techniques, our method only requires training a model with all
possible predictors. Second, we utilize a fast and scalable bootstrap procedure
that only requires Monte-Carlo sampling to obtain bootstrapped copies of the
estimated vector of coefficients. Using this bootstrap sample, we obtain the
-value for each SNP, and select SNPs having -values below a threshold. We
illustrate through numerical studies that our method is more effective in
detecting SNPs associated with a trait than either single-marker analysis using
family data or model selection methods that ignore the familial dependency
structure. We also use the -values to perform gene-level analysis in nuclear
families and detect several SNPs that have been implicated to be associated
with alcohol consumption
Distribution-free cumulative sum control charts using bootstrap-based control limits
This paper deals with phase II, univariate, statistical process control when
a set of in-control data is available, and when both the in-control and
out-of-control distributions of the process are unknown. Existing process
control techniques typically require substantial knowledge about the in-control
and out-of-control distributions of the process, which is often difficult to
obtain in practice. We propose (a) using a sequence of control limits for the
cumulative sum (CUSUM) control charts, where the control limits are determined
by the conditional distribution of the CUSUM statistic given the last time it
was zero, and (b) estimating the control limits by bootstrap. Traditionally,
the CUSUM control chart uses a single control limit, which is obtained under
the assumption that the in-control and out-of-control distributions of the
process are Normal. When the normality assumption is not valid, which is often
true in applications, the actual in-control average run length, defined to be
the expected time duration before the control chart signals a process change,
is quite different from the nominal in-control average run length. This
limitation is mostly eliminated in the proposed procedure, which is
distribution-free and robust against different choices of the in-control and
out-of-control distributions.Comment: Published in at http://dx.doi.org/10.1214/08-AOAS197 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- β¦